Skip to content

Conversation

tkattkat
Copy link
Collaborator

part of STG-653

why

Adds more evals to agent

what changed

Added ~ 15 new evals

test plan

  • tested locally
  • tested on browserbase

Copy link

changeset-bot bot commented Aug 12, 2025

🦋 Changeset detected

Latest commit: a824aa6

The changes in this PR will be included in the next version bump.

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Summary

This PR adds 17 new agent evaluation tasks to the Stagehand evaluation suite as part of STG-653. The new evaluations test the AI agent's capabilities across diverse real-world scenarios including e-commerce (Amazon shoes, Google Shopping, UberEats), entertainment platforms (Steam Games, Apple TV), research tools (arXiv, Hugging Face, WolframAlpha), and various web services (GitHub, Google Maps, NBA trades on ESPN, hotel booking).

All new evaluation files follow the established pattern from the existing agent evaluation framework:

  • Navigate to target website using stagehand.page.goto()
  • Create an agent with dynamic provider selection based on model name (Claude uses "anthropic", others use "openai")
  • Execute specific instructions with defined step limits (typically 14-30 steps)
  • Evaluate success based on agentResult.success property
  • Include proper error handling, logging, and resource cleanup

The tasks are added to evals.config.json under the 'agent' category, integrating them into the existing evaluation pipeline. These evaluations expand test coverage to validate agent performance across complex multi-step workflows like checkout processes, search filtering, information extraction, and form filling on production websites.

Confidence score: 3/5

  • This PR requires careful review due to several evaluation quality issues and potential risks from using production websites
  • Score lowered due to lack of proper result validation in most tasks, reliance on production sites that may change, and some logical flaws in evaluation criteria
  • Pay close attention to evals/tasks/agent/kith.ts for payment form risks, evals/tasks/agent/hotel_booking.ts for validation gaps, and the formatting issue in evals.config.json

17 files reviewed, 6 comments

Edit Code Review Bot Settings | Greptile

@tkattkat tkattkat changed the title More evals Add more ore agent evals STG-653 Aug 12, 2025
@tkattkat tkattkat requested a review from seanmcguire12 August 12, 2025 21:37
@tkattkat tkattkat marked this pull request as draft August 13, 2025 00:10
@tkattkat tkattkat marked this pull request as ready for review August 13, 2025 00:24
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Summary

This review covers only the changes made since the last review (commit e71810e), not the entire PR.

The most recent changes complete a major refactoring of the agent evaluation system by centralizing agent initialization logic. The key changes include:

  1. Agent initialization centralization: All agent evaluation functions have been updated to receive a pre-configured agent parameter instead of creating their own agent instances. This eliminates the duplicate model selection and provider mapping logic that was scattered across individual evaluation files.

  2. Type system updates: The StagehandInitResult type in types/evals.ts now includes an agent property using ReturnType<Stagehand["agent"]>, enabling evaluation functions to access agent functionality through dependency injection.

  3. Centralized configuration: The initStagehand.ts file now includes Computer Use Agent (CUA) model detection logic that automatically determines if a model supports computer use capabilities (checking for 'computer-use-preview' or models starting with 'claude') and creates appropriate agent configurations with proper provider mapping.

  4. Standardized evaluation pattern: All ~20 agent evaluation files now follow a consistent pattern where they receive a pre-initialized agent, execute instructions using agent.execute(), and validate results based on agentResult.success. This creates uniformity across the evaluation suite.

The refactoring moves from a decentralized approach where each evaluation file handled its own agent setup to a centralized dependency injection pattern. This architectural change reduces code duplication, ensures consistent agent configuration across all evaluations, and provides better maintainability for global agent behavior modifications. The changes integrate with the existing evaluation framework by extending the StagehandInitResult interface and updating the initialization flow to provide agent functionality to evaluation tasks.

Confidence score: 4/5

  • This PR is safe to merge with good architectural improvements and consistent patterns
  • Score reflects clean refactoring with proper type safety, though some evaluations lack robust result validation
  • Pay close attention to files with time-dependent instructions or weak validation logic

29 files reviewed, 4 comments

Edit Code Review Bot Settings | Greptile

@tkattkat tkattkat changed the title Add more ore agent evals STG-653 Add more agent evals STG-653 Aug 13, 2025
@miguelg719 miguelg719 changed the title Add more agent evals STG-653 Add more agent evals Aug 13, 2025
@tkattkat tkattkat requested a review from seanmcguire12 August 13, 2025 22:19
types/evals.ts Outdated
@@ -13,6 +13,7 @@ export type StagehandInitResult = {
sessionUrl: string;
stagehandConfig: ConstructorParams;
modelName: AvailableModel;
agent: ReturnType<Stagehand["agent"]>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have an actual agent type

@tkattkat tkattkat merged commit 8422828 into main Aug 19, 2025
14 checks passed
@github-actions github-actions bot mentioned this pull request Aug 14, 2025
miguelg719 pushed a commit that referenced this pull request Aug 19, 2025
This PR was opened by the [Changesets
release](https://github.com/changesets/action) GitHub action. When
you're ready to do a release, you can merge this and the packages will
be published to npm automatically. If you're not ready to do a release
yet, that's fine, whenever you add more changesets to main, this PR will
be updated.


# Releases
## @browserbasehq/[email protected]

### Patch Changes

- [#951](#951)
[`f45afdc`](f45afdc)
Thanks [@miguelg719](https://github.com/miguelg719)! - Patch GPT-5 new
api format

- [#954](#954)
[`261bba4`](261bba4)
Thanks [@seanmcguire12](https://github.com/seanmcguire12)! - add support
for shadow DOMs (open & closed mode) when experimental: true

- [#944](#944)
[`8de7bd8`](8de7bd8)
Thanks [@seanmcguire12](https://github.com/seanmcguire12)! - Bump zod
version compatibility and add pathing spec

- [#919](#919)
[`3d80421`](3d80421)
Thanks [@seanmcguire12](https://github.com/seanmcguire12)! - enable
scrolling inside of iframes

- [#963](#963)
[`0ead63d`](0ead63d)
Thanks [@tkattkat](https://github.com/tkattkat)! - Properly handle
images in evaluator + clean up response parsing logic

- [#961](#961)
[`8422828`](8422828)
Thanks [@tkattkat](https://github.com/tkattkat)! - Add more evals for
stagehand agent

- [#946](#946)
[`b769206`](b769206)
Thanks [@seanmcguire12](https://github.com/seanmcguire12)! - fix: unable
to act on/get content from some same process iframes

- [#962](#962)
[`72d2683`](72d2683)
Thanks [@seanmcguire12](https://github.com/seanmcguire12)! - handle
namespaced elements in xpath build step

## @browserbasehq/[email protected]

### Patch Changes

- Updated dependencies
\[[`f45afdc`](f45afdc),
[`261bba4`](261bba4),
[`8de7bd8`](8de7bd8),
[`3d80421`](3d80421),
[`0ead63d`](0ead63d),
[`8422828`](8422828),
[`b769206`](b769206),
[`72d2683`](72d2683)]:
    -   @browserbasehq/[email protected]

## @browserbasehq/[email protected]

### Patch Changes

- Updated dependencies
\[[`f45afdc`](f45afdc),
[`261bba4`](261bba4),
[`8de7bd8`](8de7bd8),
[`3d80421`](3d80421),
[`0ead63d`](0ead63d),
[`8422828`](8422828),
[`b769206`](b769206),
[`72d2683`](72d2683)]:
    -   @browserbasehq/[email protected]

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants